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Abstract 

Background: Wlien multiple prognostic factors are adjusted for in the analysis of a randomised trial, it is unclear (1) 
whether it is necessary to account for each of the strata, formed by all combinations of the prognostic factors 
(stratified analysis), when randomisation has been balanced within each stratum (stratified randomisation), or 
whether adjusting for the main effects alone will suffice, and (2) the best method of adjustment in terms of type I 
error rate and power, irrespective of the randomisation method. 

Methods: We used simulation to (1) determine if a stratified analysis is necessary after stratified randomisation, and 
(2) to compare different methods of adjustment in terms of power and type I error rate. We considered the 
following methods of analysis: adjusting for covariates in a regression model, adjusting for each stratum using 
either fixed or random effects, and Mantel-Haenszel or a stratified Cox model depending on outcome. 

Results: Stratified analysis is required after stratified randomisation to maintain correct type I error rates when (a) 
there are strong interactions between prognostic factors, and (b) there are approximately equal number of patients 
in each stratum. However, simulations based on real trial data found that type I error rates were unaffected by the 
method of analysis (stratified vs unstratified), indicating these conditions were not met in real datasets. Comparison 
of different analysis methods found that with small sample sizes and a binary or time-to-event outcome, most 
analysis methods lead to either inflated type I error rates or a reduction in power; the lone exception was a 
stratified analysis using random effects for strata, which gave nominal type I error rates and adequate power. 

Conclusions: It is unlikely that a stratified analysis is necessary after stratified randomisation except in extreme 
scenarios. Therefore, the method of analysis (accounting for the strata, or adjusting only for the covariates) will not 
generally need to depend on the method of randomisation used. Most methods of analysis work well with large 
sample sizes, however treating strata as random effects should be the analysis method of choice with binary or 
time-to-event outcomes and a small sample size. 

Keywords: Randomised controlled trial. Stratified randomisation. Restricted randomisation, Covariate adjusted 
analysis. Stratified analysis 



Background 

Some randomised controlled trials (RCTs) adjust their 
analyses for prognostic factors which are thought to influ- 
ence outcome (such as age or disease stage). This is com- 
monly done to increase power [1-7], to guard against 
chance imbalances between treatment arms [3,8], or be- 
cause the prognostic factors have been used as balancing 
variables in the randomisation process and it is necessary 
to account for them in the analysis to obtain correct type I 
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error rates [9-14]. There are often several available 
methods to account for covariates in a trial analysis. For 
example, with a binary outcome either logistic regression 
with the prognostic factors as covariates or a Mantel- 
Haenszel technique may be used to estimate the treatment 
effect. Alternatively, one could adjust for the individual 
strata, formed by all combinations of the covariates, using 
a logistic regression model with the strata modelled as ei- 
ther fixed or random effects. 

Although many articles have highlighted the benefits of 
covariate adjustment [1-8], relatively little attention has 
been paid to the best method of adjustment with multiple 
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prognostic factors (with only one prognostic factor most 
methods of adjustment will give a similar answer [3]). Like- 
wise, little research has looked into whether the type of ad- 
justment should match the type of randomisation (e.g. a 
stratified analysis for stratified randomisation). When ran- 
domisation is carried out within each stratum (for example 
using stratified permuted blocks [15]), not only wOl each 
stratification factor be balanced between treatment arms, 
but each combination of stratification factors (i.e. each 
stratum) will as well. If there is an interaction between bal- 
ancing factors (e.g. if the effect of age on outcome depends 
on the patient's disease stage), it may be necessary to ac- 
count for not only the stratification factors, but also their 
interactions (or for each stratum) in the analysis in order 
to obtain correct type I error rates [9]. Conversely, when 
randomised is not carried out within strata (e.g. when bal- 
ancing factors are not used in the randomisation process, 
or when covariates are balanced marginally), it may be un- 
necessary to account for interactions between covariates to 
obtain correct type I error rates (although adjustment for 
strong interactions may lead to increased power). Mini- 
misation [15] is the most commonly used method of bal- 
ancing covariates marginally [10]. Briefly, the balance 
between treatment groups is calculated for each covariate, 
and then summed across all covariates to give an overall 
measure of balance. The patient is then allocated to the 
group that would give the best overall balance (usually with 
an element of probability [16]). Thus, treatment assign- 
ments are balanced within a covariate considered individu- 
ally, but not within combinations of balancing covariates. 
This implies that for minimisation, adjusting only for the 
covariates used in the minimisation process (and not for 
their interactions, or equivalently each strata) should give 
valid type I error rates. 

The goals of this paper are (1) to determine whether 
a stratified analysis is necessary to maintain correct 
type I error rates after randomisation is performed 
within each stratum (stratified randomisation), and (2) 
to compare different methods of adjustment in terms 
of type I error rate and power, irrespective of the 
method of randomisation used. We do not consider the 
issue of how best to adjust for centre-effects in 
multicentre RCTs, as this issue has different consider- 
ations, and has been discussed previously [3,14,17-21]. 
We also do not consider the case of treatment-by-co- 
variate (or treatment-by-strata) interaction (that is, all 
analysis methods presented here assume the treatment 
effect is constant across different covariates or strata). 

Methods 

Methods of adjustment 

Methods of adjustment for covariates will generally fall 
into two classes. The first involves adjusting only for the 



individual covariates. We refer to this as a covariate-ad- 
justed analysis. The second involves adjusting for each 
individual stratum. We refer to this as a stratified ana- 
lysis. We illustrate these two different approaches using 
an example. Suppose we wish to account for two binary 
covariates in the analysis: gender (male vs. female) and 
disease stage (early vs. late). A covariate-adjusted ana- 
lysis would involve adjusting for gender and disease 
stage as two separate variables in a regression model. A 
stratified analysis however would account for each 
stratum formed by these covariates (male/early stage vs 
male/late stage vs female/early stage vs female/late 
stage). This could be done by adjusting for each stratum 
in a regression model using three dummy variables, but 
is often done by performing the analysis within each 
stratum, and combining the results. Examples of this in- 
clude Mantel-Haenzel for binary outcomes, or a strati- 
fied Cox model for time-to-event outcomes. 

We illustrate some general approaches to adjustment 
using the previous example, where there were two covari- 
ates of interest, gender and disease stage. Let Xq be a bin- 
ary variable indicating the patients gender (0 = female, 1 = 
male), Xs be a binary variable indicating the disease stage 
(0 = early, 1 = late), and Xty^^t be a binary variable indicat- 
ing whether the patient received the treatment or not. 

Then, a covariate-adjusted analysis can be carried out 
by adjusting for the individual covariates in a regression 
model. This can be done with a model of the form: 

f{Y)=a + li,,,,,X,reat + I^G^G + ^s^S (1) 

Where Y is the patient outcome, f(. ) is the link func- 
tion, a the intercept, and the /S's represent regression 
coefficients. 

There are several methods of performing a stratified 
analysis, some of which apply only to certain outcome 
types. One method of performing a stratified analysis is 
to account for all the strata in a regression model using 
indicator variables. This can be thought of as a stratified 
analysis using fixed effects (this is a common method of 
analysis for multicentre trials with continuous outcomes 
[14]). Since gender and disease stage form four strata 
(female/early, female/late, male/early, male/late) we need 
three dummy variables. Let Xpi be a binary variable indi- 
cating whether the patient was female with late-stage 
disease (0 = no, 1 = yes), Xme be a binary variable indicat- 
ing whether the patient was male with early-stage dis- 
ease (0 = no, 1 = yes), and Xml be a binary variable 
indicating whether the patient was male with late-stage 
disease (0 = no, 1 = yes) (the choice of which stratum to 
drop from the parameterisation is arbitrary, and will 
have no impact on the estimated treatment effect or its 
standard error). 
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Then, a stratified analysis using fixed effects can be 
performed using the following model: 

f{Y) = a+ Pt^g^i-Xtreat + PpL^FL + PmE^ME + PmL^ML 

(2) 

It should be noted that (provided there are no continu- 
ous covariates) equation (2) is equivalent to adjusting for 
all the individual covariates as well as all interactions in 
the sense that the treatment effect and its standard error 
will be identical. 

Another method of performing a stratified analysis is 
to treat strata as random effects from a distribution 
(which is also sometimes used to analyse multicentre tri- 
als [14]). This can be thought of as a stratified analysis 
using random effects. Consider the scenario where we 
have i strata (in the above example we have four strata). 
A stratified analysis using random effects could then be 
performed using the following model: 

f{Y) = a+Uj+P,,^^,X,reat (3) 

where Uj is a random effect for the /th stratum, uj would 
generally be assumed to follow a normal distribution. This 
assumption is likely to be violated in many scenarios, but 
previous research has shown that the fixed parameters 
from random effects models are robust to misspecification 
of the random effects distribution [14,22]. 

As mentioned previously, some types of stratified ana- 
lyses can only be performed for specific outcome types. 
Two examples of this are a Mantel-Haenszel analysis 
and a stratified Cox model. Mantel-Haenszel applies to 
binary outcomes, and involves calculating an odds ratio 
within each stratum, then calculating a weighted average 
of the results to get a final estimate. A stratified Cox 
model is performed with time-to-event outcomes, and 
involves calculating a hazard ratio within each stratum, 
then combining the results for a final estimate. 

For continuous outcomes, the interpretation of the 
treatment effect will not depend on the method of ana- 
lysis. However, for binary or time-to-event outcomes, 
adjustment for different factors can lead to different esti- 
mates [23,24], and therefore must be interpreted based 
on the adjustment factors. If the interactions between 
prognostic factors are large, a stratified-analysis may lead 
to larger estimates of treatment effect compared with a 
covariate-adjusted analysis, and may therefore increase 
power [23,24]. 

Simulation study based on theoretical data 

We performed a simulation study to determine (i) if a 
stratified analysis is necessary after stratified randomisa- 
tion to maintain correct type I error rates, and (ii) 
whether a covariate-adjusted analysis is adequate after 



minimisation for correct type I error rates (rendering a 
stratified analysis unnecessary). 

We used two different methods of randomisation; (i) 
stratified permuted blocks, with a block size of 2, and 
(ii) minimisation with a random component of 80%. For 
each method of randomisation we performed two ana- 
lyses; (a) covariate-adjusted analysis, using equation 1, and 
(b) stratified analysis using fixed effects, using equation 2 
(which, as noted previously, is the same as model (4), 
and so is equivalent to the data generating model). We 
therefore assessed four randomisation-analysis combi- 
nations in total: 

a) Stratified permuted blocks, with a covariate-adjusted 
analysis 

b) Stratified permuted blocks, with a stratified analysis 

c) Minimisation, with a covariate-adjusted analysis 

d) Minimisation, with a stratified analysis 

Our hypothesis was that using a covariate-adjusted 
analysis after stratified randomisation would lead to in- 
flated type I error rates when there were substantial in- 
teractions between prognostic factors, but that a 
stratified analysis would lead to nominal type I error 
rates. Conversely, we hypothesised that a covariate- 
adjusted and a stratified analysis would lead to nominal 
type I error rates after minimisation. 

We generated continuous outcomes from the follow- 
ing model (which in this scenario is equivalent to model 
(2) above): 

yi = CC + fitreat^treat + ^i^l + ^2^2 + ^12^12 + 

(4) 

where Yt is the outcome from the rth patient, Xi and X2 
are balancing factors and X12 is their interaction, yS^ and ^82 
are the regression coefficients for those balancing factors, 
and P12 is the regression coefficient for their interaction, e, 
is a random error term, and is normally distributed with 
mean 0 and variance 0^. 

We performed two sets of simulations. In the first, we 
varied the size of while holding the other parameters 
constant. In the second, we varied the proportion of pa- 
tients with X12 = 1 while holding the other parameters 
constant. More information on both of these scenarios is 
available below. We set the sample size to 250 patients, 
and used 5000 replications for each scenario to give a 
standard error of about 0.3% when estimating the type I 
error rate, assuming a true type I error rate of 5%. 

Varying the size of the interaction term 

For the first set of simulations we varied the size of Iii2 
while holding the other parameters constant. We varied 
Iix2 from 0 to 3 in increments of 0.2. j6j and were set 
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to 0.5, fitreat was Set to 0 and to 1. We set V{Xi = 1) = 
P(X2 = 1) = 0.5, and generated and X2 independently. 
It follows that P{Xi2 = 1) = 0.25. 

Varying the distribution of patients across strata 

For the second set of simulations we varied the propor- 
tion of patients with X12 = 1 while holding the other pa- 
rameters constant. We did this by varying V{Xj = 1) and 
P(X2= 1) together from 0.1 to 0.5 in increments of 0.05. 
This corresponds to varying P(Xj2 = 1) from 0.01 to 0.25. 
All other parameters were set to the same values as 
above, except ySj2 which was set to 1.5 (which is 50% lar- 
ger than a, and is unlikely to occur often in practice, but 
is used here for the purposes of illustration). 

Simulation study based on real trial data 
Metiiods 

We performed a simulation study based on real datasets 
to (1) determine whether stratified randomisation is ne- 
cessary to maintain correct type I error rates after strati- 
fied randomisation in real trial scenarios, and (2) to 
compare different methods of adjustment in terms of 
type I error rate and power, irrespective of the method 
of randomisation used. We used three datasets (one each 
with a continuous, binary, and time-to-event outcome), 
which are further described below. 
When generating data, we used the linear predictor: 

c c 

c—1 c— 1 d>c 

where denotes the main effect from the cth covariate, 
and l^cd denotes the two-way interaction between the cth 
and dth covariates (for d). More information on the 
exact data generating models can be found below. 

For simplicity, we chose to simulate data based only 
on the two-way interactions between covariates, rather 
than including any three-way or higher interactions. It 
should be noted that although we simulated data based 
only on the two-way interactions, stratified analyses were 
performed adjusting for all interactions (or, equivalently, 
all strata), rather than only the two-way interactions. 

We generated prognostic variables from a multivariate 
normal distribution with a covariance matrix based on 
the original data set so that the proportion of patients in 
each stratum was similar to the original study. We then 
categorised binary covariates using a cut-point specified 
to give the desired proportions in each group. 

We randomised patients to one of two treatments 
using three different methods: (1) simple randomisation, 
where all patients had a 50% chance of either treatment; 
(2) stratified permuted blocks, with a block size of 4; 
and (3) minimisation, with a random element of 80% 



(i.e. patients were assigned to the preferred treatment 
arm with a probability of 80%). 

As above, we used 5000 replications for each scenario. 
We compared different analysis methods in terms of the 
type I error rate and power. For continuous, binary, and 
time-to-event outcomes, the treatment effect was calcu- 
lated as a difference in means, an odds ratio, and a haz- 
ard ratio respectively. To assess the type I error rate, we 
set jStreat to 0. To asscss power, we set /Streat to give 80% 
power based on the specified sample size (for binary and 
time-to-event outcomes, we powered the study based on 
reducing, rather than increasing, the number of events). 

MIST2 (continuous outcome) For continuous out- 
comes, we based our simulations on the MIST2 trial, 
which has been described previously [9,10,25,26]. Briefly, 
MIST2 was a randomised controlled trial assessing 
whether tissue plasminogen activator, deoxyribonoclease, 
or their combination was effective in reducing the size 
of patients' pleural effusion (a continuous outcome). 
Two hundred and ten patients were randomised using 
minimisation, with a random component of 80%. Balan- 
cing variables were the size of the baseline pleural effu- 
sion (greater or less than 30% of the hemithorax), 
whether the patient was purulent, and whether the infec- 
tion was community or hospital acquired. 
We generated data from the following model: 

Yi = rii + Ei 

where Yi is the outcome for the ith patient, tji is the lin- 
ear predictor (as in equation 5), and e, is a random 
error term. The parameters for the covariates and their 
two-way interactions can be found in Table 1. Ei was 
generated from a normal distribution with mean 0 and 
standard deviation 19.1. 

We used sample sizes of 100, 200, 500, and 1000 pa- 
tients. We used three methods of analysis; (a) covariate- 
adjusted analysis (equation 1); (b) stratified analysis 



Table 1 Parameters from the IVIIST2 dataset 



Variable 


Proportion of patients 
with covariate 


Regression 
parameter 


iVIain effects 


Pleural effusion >30% 




67 


-25.4 


Purulence 




49 


1.8 


Hospital infection 




13 


-7.0 


Interaction 


Pleural effusion X purulence 




29 


3.5 


Pleural effusion X hospital 
infection 




7 


24.4 


Purulence X hospital infection 




6 


-1.6 



Kahan and Morris BMC Medical Research Methodology 201 3, 1 3:99 
http://www.biomedcentral.com/1471-2288/13/99 



Page 5 of 1 1 



using fixed effects (equation 2); and (c) stratified analysis 
using random effects (equation 3). 

AUGIB (binary outcome) For binary outcomes, we 
based our simulations on the Acute Upper Gastrointestinal 
Bleeding (AUGIB) audit dataset. This dataset has been de- 
scribed previously [27-30]. Briefly, this was an observa- 
tional dataset collected on consecutive patients presenting 
with AUGIB in the UK. We used further bleeding as an 
outcome. We chose four prognostic factors; urea (as a bin- 
ary covariate, dichotomised at its median), presence of 
shock, prolonged coagulation, and outpatient at admission. 
The dataset comprised of 4342 patients with complete data 
on the above covariates. 

We generated latent outcomes using the following 
model: 

Y* = ri, + e, 

where Yi is a latent outcome for j'th patient, ^, is the lin- 
ear predictor (as in equation 5), and e, is a random error 
term that follows a logistic distribution with mean 0 and 
variance tt^/3. Binary responses were generated as = 1 
if Yi > 0, and 0 otherwise. The regression parameters for 
the covariates and their two-way interactions can be 
found in Table 2. 

We used sample sizes of 100, 200, 500, 1000, and 2000 
patients. We used four methods of analysis; (a) covariate- 
adjusted analysis (equation 1); (b) stratified analysis using 
fixed effects (equation 2); (c) stratified analysis using ran- 
dom effects (equation 3); and (d) Mantel-Haenszel esti- 
mates (a type of stratified analysis). 

The event rate was approximately 23.6% in the control 
arm for all scenarios, except for a sample size of 100 
where we used an event rate of about 30.1% in order to 



Table 2 Parameters from the AUGIB dataset 


Variable 


Proportion of patients 
with covariate 


Odds ratio 


Main effects 


Outpatient 


83 


0.32 


Shock 


36 


2.83 


Urea > 9.1 


55 


1.88 


Coagulation 


13 


1.87 


Interaction 


Outpatient X shock 


30 


1.07 


Outpatient X urea 


44 


1.67 


Outpatient X coagulation 


10 


1.97 


Shock X urea 


24 


071 


Shock X coagulation 


6 


0.79 


Urea X coagulation 


9 


077 



ensure an adequate number of events occurred in the 
treatment arm when assessing power. 

PEC (time-to-event outcome) For time-to-event out- 
comes, we based our simulations on the PBC trial. This 
dataset has been described previously [9,31]. Briefly, this 
was a randomised trial assessing whether D-penicillamine 
could increase overall survival time (primary outcome) in 
patients with primary biliary cirrhosis. We chose four 
prognostic factors; age, log(bilirubin), albumin (all as bin- 
ary covariates, dichotomised at their medians), and disease 
stage (1/2 vs 3/4). The dataset comprised of 312 patients. 

We generated time to event outcomes using the 
method described by Bender et al. [32]: 

Yi = Hl'[-HU)e^p{-ri)] 

where is the time to death, ^, is the linear predictor 
(as in equation 5), Hq is the cumulative baseline hazard 
function, and i/~ Uniform (0, 1). This model implies 
proportional hazards. We censored event times at a cut- 
point specified to give a similar proportion of censoring 
as seen in the original dataset. The event rate was ap- 
proximately 40.4% in the control arm for all scenarios. 
The regression parameters for the covariates and their 
two-way interactions can be found in Table 3. 

We used sample sizes of 100, 200, 500, 1000, and 2000 
patients. We used four methods of analysis; (a) covariate- 
adjusted analysis (equation 1); (b) stratified analysis using 
fixed effects (equation 2); (c) stratified analysis using ran- 
dom effects (equation 3); and (d) a stratified Cox model 
(a type of stratified analysis). 

Sensitivity analysis We performed a sensitivity analysis 
to assess whether increasing the size of the observed in- 
teractions in the MIST2, PBC, and AUGIB datasets had 



Table 3 Parameters from the PBC dataset 

Variable Proportion of patients Hazard 

with covariate ratio 

Main effects 

Age > 50 years 51 3 01 

Log(bilirubin) > 0.3 50 10.65 

Albumin > 3.55 50 0.58 

Disease stage > 3 73 5.02 

Interaction 

Age X log(bilirubin) 24 1.03 

Age X albumin 23 1.40 

Age X disease stage 38 0.66 

Log(bilirubin) X albumin 18 0.77 

Log(bilirubin) X disease stage 42 0.45 

Albumin X disease stage 32 0.55 
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any impact on type I error rates. Simulations were 
performed as above, but we systematically increased the 
size of each interaction term by of a factor of 2.5, 5, 7.5, 
and 10. For example, the size of the observed interaction 
terms in the MIST2 dataset (Table 1) was -1.6, 3.5, and 
24.4. Increasing these interactions by a factor of 2.5 for 
this sensitivity analysis led to interaction sizes of -4.0, 8.8, 
and 61.0 respectively. 

Results 

Simulation study based on theoretical data 
Varying the size of ttie interaction term 

Results are shown in Figure la. When patients were 
randomised using minimisation, both a covariate adjusted 
analysis and a stratified analysis gave valid type I error rates, 
regardless of the size of the interaction. Likewise, when a 
stratified analysis was used after patients were randomised 
using stratified permuted blocks, error rates were nominal. 
However, a covariate adjusted analysis gave incorrect type I 
error rates whenever the interaction was ^ 0. 

This demonstrates that, in principle, when randomisa- 
tion has been balanced within strata, a stratified analysis 
may be necessary to maintain nominal type I error rates 
when there are large interactions between balancing fac- 
tors. For minimisation, which does not balance within 
strata, either a stratified or a covariate adjusted analysis 



will give valid type I error rates in the presence of large 
interactions. 

Varying the distribution of balancing factors 

Results are shown in Figure lb. The impact of a covariate 
adjusted analysis after randomisation using stratified 
blocks on the type I error rate depended on the propor- 
tion of patients with X22 = 1; when this was small, type I 
error rates were close to nominal. However, as this in- 
creased, the type I error rates became too low. 

This demonstrates that it is not only the size of the 
interaction which could impact the type I error rate 
under a covariate adjusted analysis, but also the distribu- 
tion of patients across the different strata. 

Simulation study based on real trial data 
IVIIST2 (continuous outcome) 

Results after stratified permuted blocks and minimisation 
are shown in Figure 2. As expected, all methods of analysis 
(covariate-adjusted analysis, stratified analysis using fixed 
effects, and a stratified analysis using random effects) gave 
close to nominal type I error rates after simple randomisa- 
tion (results not shown) or minimisation. Stratified ana- 
lyses (using either fixed or random effects) gave valid 
results after randomisation using stratified permuted 
blocks. However, a covariate adjusted analysis also gave 



7.5 



^ 5 



2.5 




1 2 

P (interaction effect) 




.05 



.1 .15 

P(X„=1) 



.25 



SPB, covariate adjusted 
Min, covariate adjusted 



SPB, stratified analysis 
IVlin, stratified analysis 



Figure 1 Type I error rate with different randomisation and analysis methods. (A) shows the type I error rate for different values of the 
interaction term between the two balancing factors. (B) shows the type I error rate for different values of P(X,2= 1) (which indicates the proportion 
of patients in the interaction group i.e. with both prognostic factors present). Two methods of randomisation were used (Min = minimisation, 
SPB = stratified permuted blocks), and two methods of analysis were used (covariate adjusted indicates the two balancing factors were used as 
covariates in a regression model; stratified analysis denotes that the strata formed from the combinations of the two balancing factors were 
entered as covariates in a regression model). 
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Figure 2 Type I error rate and power results for the IVIIST2 trial (continuous outcome). 



close to nominal type I error rates after stratified per- 
muted blocks, contradicting results seen earlier. 

A stratified analysis using random effects gave a small 
increase in power compared to either a covariate adjusted 
analysis or a stratified analysis using fixed effects with a 
sample size of 100 (approximately 1-2% across different 
randomisation methods). For larger sample sizes (between 
200 and 1000 patients), stratified analyses using either 
fixed or random effects had similar levels of power, and 
were slightiy more powerful than covariate adjusted ana- 
lysis (approximately 1.5%). 

Each method of analysis had a convergence rate of 
100% in all scenarios. 



AUGIB (binary outcome) 

Results after stratified permuted blocks and minimisation 
are shown in Figure 3. For larger sample sizes (500 or 
more patients), all analysis methods gave similar type I 
error rates and power; as above, a covariate-adjusted ana- 
lysis gave correct type I error rates, even when used after 
stratified permuted blocks. Convergence rates were greater 
than 99% for all analysis methods. 

For smaller sample sizes (100 or 200 patients), results for 
the different analysis methods were less similar. With 100 
patients, all analysis methods apart from a stratified analysis 
using random effects had convergence issues; convergence 
rates for a covariate-adjusted analysis, a stratified analysis 
using fixed effects, and Mantel-Haenszel varied between 
96-97%, whereas rates for a stratified analysis using random 



effects were >99%. All convergence rates were >99% with 
200 patients. 

The type I error rate for Mantel-Haenszel was too low 
with a sample size of 100 patients; this lead to a small loss 
in power (between 2-5% compared with a stratified ana- 
lysis using random effects). With 200 patients, Mantel- 
Haenszel experienced a loss in power of 1-5% compared 
with a stratified analysis using random effects. Type I error 
rates and power results between a covariate-adjusted ana- 
lysis and stratified analyses using either fixed or random 
effects were similar for 100 or 200 patients. 

PBC (time-to-event outcome) 

Results after stratified permuted blocks and minimisation 
are shown in Figure 4. Convergence rates were above 
99.9% for each method of analysis in all scenarios. For 
large sample sizes (1000 or 2000 patients) each method of 
analysis gave similar results; as above, a covariate-adjusted 
analysis gave correct type I error rates when used with 
stratified permuted blocks. 

For other sample sizes (100, 200, and 500 patients) a 
stratified analysis using fixed effects had type I error 
rates that were too large (range across three different 
randomisation methods 7.4-8.5%, 6.2-6.7%, and 5.5- 
5.8% for 100, 200, and 500 patients respectively). Type I 
error rates were slightly too large for covariate-adjusted 
analyses, although less so than for stratified analyses 
using fixed effects (range across different randomisation 
methods 5.4-6.0%, 5.5-5.7%, and 5.1-5.7% for 100, 200, 
and 500 patients respectively). Conversely, stratified 
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Figure 3 Type I error rate and power results for the AUGIB dataset (binary outcome). 



analyses using random effects and stratified Cox models 
gave nominal type I error rates (range across random- 
isation methods and sample sizes 4.7-5.5% and 4.4-5.6% 
for 100-500 patients for stratified analyses using ran- 
dom effects and stratified Cox models respectively). 

Stratified analyses using fixed effects and covariate- 
adjusted analyses had highest power for smaller sample 



sizes, although this is likely a result of the inflated type I 
error rate associated with these analysis methods. Of the 
two analysis methods that gave nominal type I error rates 
for smaller sample sizes, stratified analyses using random 
effects had higher power than stratified Cox models 
(power increases of 3.6-5.5% and 1.7-2.5% for sample sizes 
of 100 and 200 respectively). 
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Figure 4 Type I error rate and power results for the PBC trial (time-to-event outcome). 
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Sensitivity analysis 

Results are shown in Figure 5. For MIST2, the type I 
error rate was too low when the size of the observed in- 
teractions was increased by a factor of at least 2.5. For 
the PBC and AUGIB datasets, type I error rates were 
not substantially affected until the interactions had been 
increased by a factor of at least 7.5. 

To be these results in perspective, increasing the ob- 
served interactions from the MIST2 dataset by a factor 
of 2.5 resulted in the effect size of the largest interaction 
being increased to almost 60 (approximately 3 times lar- 
ger than the residual standard deviation). Increasing the 
observed interactions in the PBC and AUGIB datasets 
by a factor of 7.5 resulted in the odds ratio or hazard ra- 
tio of the largest interactions being increased to almost 
399 and 162 respectively. 

Discussion 

Our aims for this paper were to (1) determine whether it 
is necessary to perform a stratified analysis after using a 
randomisation method that balances within strata (such 
as stratified permuted blocks) to obtain correct type I 
error rates, and (2) to compare different methods of ac- 
counting for multiple prognostic factors in terms of 
power and type I error rates, irrespective of the method 
of randomisation. 

Regarding point (1), it has previously been noted that 
for randomisation methods that balance within each 
stratum (e.g. stratified permuted blocks), it may be neces- 
sary to use a stratified analysis to obtain correct type I 
error rates when there are large interactions between bal- 
ancing factors [9]. By comparison, this issue should not 
affect randomisation methods that do not balance within 



strata (e.g. simple randomisation, permuted blocks with- 
out stratification, or minimisation), and so both covariate- 
adjusted and stratified analyses should give correct results. 
We explored this issue using simulation under specific 
(and potentially unrealistic) conditions, and found that the 
necessity of a stratified analysis after stratified randomisa- 
tion depended on (a) the size of the interactions between 
prognostic factors, and (b) the distribution of patients 
across strata. When there were both large interactions, 
and a relatively equal number of patients in each stratum, 
a covariate-adjusted analysis led to type I error rates that 
were too low. A stratified analysis by comparison gave 
valid results. However, when either the interactions were 
small, or there was a low percentage of patients in some 
strata, a covariate-adjusted analysis gave close to nominal 
type I error rates. As expected, both analysis methods gave 
valid results after minimisation. This is because minimisa- 
tion balances baseline variables marginally, meaning that 
although variables are balanced, their interactions are not. 
Adjustment for the main effects will then be sufficient to 
obtain the nominal type I error rates. 

In order to determine whether this issue was lil<ely to 
affect real RCTs, we performed further simulations based 
on real data. Contrary to expectations, we found that a 
stratified analysis was not necessary after stratified ran- 
domisation; covariate-adjusted analyses lead to vahd re- 
sults in each of the three datasets we used. The reasons 
for this are not entirely clear. One possible explanation is 
that the interaction sizes we used (based on observed data) 
were not large enough to affect results. However, some of 
the interactions we used were substantial. For example, in 
the MIST2 trial one of the interactions was 28% larger 
than the standard deviation, in the AUGIB dataset one 



S 3 

0) 
Q. 

.i- 2 



• MIST2 ♦ PBC 



AUGIB 




2.5 



5 7.5 
IVIultiple of observed interactions 



10 
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interaction had an odds ratio of 2.01 and another an odds 
ratio of 1.65, and in the PBC dataset one interaction had a 
hazard ratio of 0.45 and two others a hazard ratio of 0.67. 
Another explanation is that the distribution of patients 
across strata affected the results (i.e. some strata had too 
low a proportion of patients). This indicates that in prac- 
tice, both large interactions between balancing factors and 
similar numbers of patients in most strata are necessary 
for a covariate-adjusted analysis to affect type I error rates 
after balancing within strata. However, the second condi- 
tion may be unlikely; this would require a similar propor- 
tion of patients in each group for all balancing factors (i.e. 
close to 50% of patients in each level of a binary factor), as 
well as small correlations between balancing factors (as 
moderate to large correlations would lead to patients be- 
ing much more likely to fall into certain stratum). In our 
view, these conditions seem unlikely to be met in practice. 
Given the imbalance in the number of patients in each 
stratum observed in the datasets, the size of the interac- 
tions would have to have been 2.5-7.5 times larger than 
they were in order to affect results. For the MIST2 trial 
for example, this would have required an interaction term 
of almost 60 (about three times larger than the standard 
deviation), which is not realistic in practice. We conclude 
that choosing between covariate-adjusted and stratified 
analyses does not need to be based on whether stratified 
randomisation was used. 

This brings us to our second question; of the numerous 
methods of analysis available, which is most powerful, irre- 
spective of the randomisation method used? For continu- 
ous outcomes, stratified analyses (either fixed or random 
effects) gave slightly higher power than a covariate-adjusted 
analysis, while all methods of analysis gave nominal type I 
error rates. 

For binary and time-to-event outcomes, all methods of 
analysis gave similar results with large sample sizes. How- 
ever, there were differences between analysis methods for 
small sample sizes. A stratified analysis using fixed effects 
led to inflated type I error rates in several scenarios, and 
cannot be recommended. A covariate-adjusted analysis 
also led to type I error rates that were too large when used 
with a time-to-event outcome; this is similar to results 
seen previously [9] where accounting for several balancing 
factors led to inflated type I error rates with a binary or 
time-to-event outcome. 

Both Mantel-Haenszel for binary outcomes and a strati- 
fied Cox model for time-to-event outcomes gave close to 
nominal type I error rate (though Mantel-Haenszel was 
slightly too low in certain scenarios), but both suffered 
from a lack of power compared with other methods. 

The one method of analysis which gave good results 
across all scenarios and sample sizes was a stratified ana- 
lysis using random effects. While other analysis methods 
gave inflated type I error rates (stratified analysis using 



fbced effects, covariate-adjusted analysis) or led to a loss 
of power (Mantel-Haenszel, stratified Cox model) with a 
binary or time-to-event outcome and a small sample 
size, a stratified analysis using random effects gave nom- 
inal type I error rates and good power. 

In this paper we have only considered methods of ad- 
justment for prognostic covariates, and have not discussed 
ways to account for centre effects in multicentre RCTs. 
Comparison of methods for adjusting for centre effects 
in multicentre RCTs have been published previously 
[14,17,18]. However, many trials may adjust for both prog- 
nostic covariates and centre effects, and it is therefore 
worth considering whether the methods we have discussed 
in this paper will apply when also accounting for centre ef- 
fects. A stratified analysis could be performed by account- 
ing for the strata made up of all the centre-prognostic 
covariate combinations. However, if the number of centres 
is large, this may lead to over-stratification, which could 
lead to a loss of power. Therefore, we do not recommend 
this approach in general. An alternative approach is to 
perform a covariate-adjusted analysis for the prognostic 
factors (provided the sample size is large enough), and to 
account for centre-effects separately (e.g. using fixed or 
random effects, or generalised estimating equations). Al- 
ternatively, if centre effects are accounted for using fbced 
effects, a stratified analysis using random effects could be 
used for the prognostic factors. 

One limitation of this paper is that we have dichotomised 
the continuous covariates from the original datasets for 
use in our simulation study. This has been done because a 
stratified analysis is only possible with categorical covari- 
ates. However, we would generally not recommend cate- 
gorising continuous variables in practice. Provided the 
sample size is large enough, we would recommend ac- 
counting for continuous variables as covariates in a regres- 
sion model. If there is a mixture of continuous and 
categorical variables, we could either perform a covariate- 
adjusted analysis for all variables, or perform a mixture of a 
covariate-adjusted and a stratified analysis, where the cat- 
egorical covariates are grouped into strata and accounted 
for using random effects, and the continuous variables are 
included as covariates in the regression model. 

Conclusion 

It is unlikely that a stratified analysis is necessary after 
stratified randomisation except in extreme scenarios. 
Therefore, the method of analysis (accounting for the 
strata, or adjusting only for the covariates) will not generally 
need to depend on the method of randomisation used. All 
of the methods of analysis considered in this article are ac- 
ceptable with a continuous outcome, although when there 
are large interactions between covariates, a stratified ana- 
lysis may increase power. With a binary or time-to-event 
outcome and a small sample size, we recommend the use 



Kahan and Morris BMC Medical Research Methodology 201 3, 1 3:99 
http://www.biomedcentral.com/1471-2288/13/99 



Page 11 of 1 1 



of a stratified analysis using random effects, as this has been 
shown to maintain nominal type I error rates while giving 
high power. For binary or time-to-event outcomes with a 
large sample size, all methods of analysis are acceptable; 
however, it is often unclear what constitutes a large sample 
size. Therefore, if in doubt, we recommend the use of a 
stratified analysis using random effects to ensure correct 
type I error rates and power. 
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